AITopics | text region

Collaborating Authors

text region

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding

Agarwal, Amit, Panda, Srikant, Pachauri, Kulbhushan

arXiv.org Artificial IntelligenceNov-13-2025

In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR errors, misspellings, and domain shifts, which are critical in real-world deployments. FS-DAG is highly performant with less than 90M parameters, making it well-suited for complex real-world applications for Information Extraction (IE) tasks where computational resources are limited. We demonstrate FS-DAG's capability through extensive experiments for information extraction task, showing significant improvements in convergence speed and performance compared to state-of-the-art methods. Additionally, this work highlights the ongoing progress in developing smaller, more efficient models that do not compromise on performance. Code : https://github.com/oracle-samples/fs-dag

data mining, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2505.1733

Country: Europe > Austria (0.28)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.89)
Information Technology > Data Science > Data Mining > Text Mining (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

A U-Net and Transformer Pipeline for Multilingual Image Translation

Sahay, Siddharth, Agarwal, Radhika

arXiv.org Artificial IntelligenceOct-28-2025

This paper presents an end-to-end multilingual translation pipeline that integrates a custom U-Net for text detection, the Tesseract engine for text recognition, and a from-scratch sequence-to-sequence (Seq2Seq) Transformer for Neural Machine Translation (NMT). Our approach first utilizes a U-Net model, trained on a synthetic dataset , to accurately segment and detect text regions from an image. These detected regions are then processed by Tesseract to extract the source text. This extracted text is fed into a custom Transformer model trained from scratch on a multilingual parallel corpus spanning 5 languages. Unlike systems reliant on monolithic pre-trained models, our architecture emphasizes full customization and adaptability. The system is evaluated on its text detection accuracy, text recognition quality, and translation performance via BLEU scores. The complete pipeline demonstrates promising results, validating the viability of a custom-built system for translating text directly from images.

machine learning, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2510.23554

Country:

North America > United States (0.69)
Europe (0.68)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding

Rie Johnson, Tong Zhang

Neural Information Processing SystemsOct-2-2025, 12:24:11 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
Asia > China > Beijing > Beijing (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.83)

Add feedback

Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

Son, Jaemin, Choi, Sujin, Yun, Inyong

arXiv.org Artificial IntelligenceSep-9-2025

Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing. A binary patch-level classifier removes non-text areas, and a max-pooling refinement step recovers fragmented text regions to enhance spatial coherence. Experiments on real-world document datasets demonstrate that our approach substantially lowers computational costs, while maintaining comparable accuracy.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2509.06415

Country: Asia > South Korea (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback

Cure or Poison? Embedding Instructions Visually Alters Hallucination in Vision-Language Models

Wang, Zhaochen, Wang, Yiwei, Cai, Yujun

arXiv.org Artificial IntelligenceAug-5-2025

Vision-Language Models (VLMs) often suffer from hallucination, partly due to challenges in aligning multimodal information. We propose Prompt-in-Image, a simple method that embeds textual instructions directly into images. This removes the need for separate text inputs and forces the model to process all content through the visual channel. We evaluate this method on three popular open-source VLMs: Qwen2.5-VL, LLaVA-1.5, and InstructBLIP. The results reveal sharp differences. Prompt-in-Image improves Qwen2.5-VL's performance, increasing POPE accuracy by 4.1 percent (from 80.2 percent to 84.3 percent) and also reducing hallucination rates on MS-COCO. In contrast, LLaVA-1.5 and InstructBLIP experience a severe performance drop, with accuracy falling from around 84 percent to near-random levels. Through detailed analysis, we found that CLIP-based encoders in LLaVA and InstructBLIP exhibit excessive attention bias toward embedded text regions, disrupting visual understanding. In contrast, Qwen's vision encoder handles text-embedded images robustly. Crucially, Prompt-in-Image reduces Qwen's modality gap, enhancing cross-modal alignment by unifying information processing through a single modality.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2508.01678

Country: Europe > Switzerland (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition

Chakraborty, Ritabrata, Palaiahnakote, Shivakumara, Pal, Umapada, Liu, Cheng-Lin

arXiv.org Artificial IntelligenceMar-19-2025

Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively expensive for real-time scenarios. In such cases, the deployment of heavy models becomes impractical due to constraints on memory, computational resources, and latency. To address these challenges, we propose a novel, training-free plug-and-play framework that leverages the strengths of pre-trained text recognizers while minimizing redundant computations. Our approach uses context-based understanding and introduces an attention-based segmentation stage, which refines candidate text regions at the pixel level, improving downstream recognition. Instead of performing traditional text detection that follows a block-level comparison between feature map and source image and harnesses contextual information using pretrained captioners, allowing the framework to generate word predictions directly from scene context.Candidate texts are semantically and lexically evaluated to get a final score. Predictions that meet or exceed a pre-defined confidence threshold bypass the heavier process of end-to-end text STR profiling, ensuring faster inference and cutting down on unnecessary computations. Experiments on public benchmarks demonstrate that our paradigm achieves performance on par with state-of-the-art systems, yet requires substantially fewer resources.

machine learning, natural language, recognition, (17 more...)

arXiv.org Artificial Intelligence

2503.15639

Country:

Asia > India > West Bengal > Kolkata (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > United Kingdom > England > Greater Manchester > Salford (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)

Add feedback

TextDestroyer: A Training- and Annotation-Free Diffusion Method for Destroying Anomal Text from Images

Li, Mengcheng, Lin, Mingbao, Chao, Fei, Lin, Chia-Wen, Ji, Rongrong

arXiv.org Artificial IntelligenceNov-1-2024

In this paper, we propose TextDestroyer, the first training- and annotation-free method for scene text destruction using a pre-trained diffusion model. Existing scene text removal models require complex annotation and retraining, and may leave faint yet recognizable text information, compromising privacy protection and content concealment. TextDestroyer addresses these issues by employing a three-stage hierarchical process to obtain accurate text masks. Our method scrambles text areas in the latent start code using a Gaussian distribution before reconstruction. During the diffusion denoising process, self-attention key and value are referenced from the original latent to restore the compromised background. Latent codes saved at each inversion step are used for replacement during reconstruction, ensuring perfect background restoration. The advantages of TextDestroyer include: (1) it eliminates labor-intensive data annotation and resource-intensive training; (2) it achieves more thorough text destruction, preventing recognizable traces; and (3) it demonstrates better generalization capabilities, performing well on both real-world scenes and generated images.

machine learning, natural language, textdestroyer, (18 more...)

arXiv.org Artificial Intelligence

2411.00355

Country:

Asia > China > Fujian Province > Xiamen (0.05)
Asia > Taiwan (0.04)
Asia > Singapore (0.04)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

A Simple and Effective Temporal Grounding Pipeline for Basketball Broadcast Footage

Harris, Levi

arXiv.org Artificial IntelligenceOct-30-2024

We present a reliable temporal grounding pipeline for video-to-analytic alignment of basketball broadcast footage. Given a series of frames as input, our method quickly and accurately extracts time-remaining and quarter values from basketball broadcast scenes. Our work intends to expedite the development of large, multi-modal video datasets to train data-hungry video models in the sports action recognition domain. Our method aligns a pre-labeled corpus of play-by-play annotations containing dense event annotations to video frames, enabling quick retrieval of labeled video segments. Unlike previous methods, we forgo the need to localize game clocks by fine-tuning an out-of-the-box object detector to find semantic text regions directly. Our end-to-end approach improves the generality of our work. Additionally, interpolation and parallelization techniques prepare our pipeline for deployment in a large computing cluster. All code is made publicly available.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2411.00862

Country: North America > United States > North Carolina (0.04)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment > Sports (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.57)

Add feedback

A Novel Framework For Text Detection From Natural Scene Images With Complex Background

Kaladagi, Basavaraj, Pujari, Jagadeesh

arXiv.org Artificial IntelligenceSep-15-2024

Recognizing texts from camera images is a known hard problem because of the difficulties in text detection from the varied and complicated background. In this paper we propose a novel and efficient method to detect text region from images with complex background using Wavelet Transforms. The framework uses Wavelet Transformation of the original image in its grayscale form followed by Sub-band filtering. Then Region clustering technique is applied using centroids of the regions, further Bounding box is fitted to each region thus identifying the text regions. This method is much sophisticated and efficient than the previous methods as it doesn't stick to a particular font size of the text thus, making it generalized. The sample set used for experimental purpose consists of 50 images with varying backgrounds. Images with edge prominence are considered. Furthermore, our method can be easily customized for applications with different scopes.

background, text detection, text region, (10 more...)

arXiv.org Artificial Intelligence

2409.09635

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)

Add feedback

SegHist: A General Segmentation-based Framework for Chinese Historical Document Text Line Detection

Hu, Xingjian, Wei, Baole, Gao, Liangcai, Wang, Jun

arXiv.org Artificial IntelligenceJul-8-2024

Text line detection is a key task in historical document analysis facing many challenges of arbitrary-shaped text lines, dense texts, and text lines with high aspect ratios, etc. In this paper, we propose a general framework for historical document text detection (SegHist), enabling existing segmentation-based text detection methods to effectively address the challenges, especially text lines with high aspect ratios. Integrating the SegHist framework with the commonly used method DB++, we develop DB-SegHist. This approach achieves state-of-theart (SOTA) on the IACC2022CHDAC (CHDAC), MTHv2, and competitive results on ICDAR2019HDRC Chinese (HDRC) datasets, with a significant improvement of 1.19% on the most challenging CHDAC dataset which features more text lines with high aspect ratios. Moreover, our method attains SOTA on rotated MTHv2 and rotated HDRC, demonstrating its rotational robustness.

aspect ratio, detection, text line, (12 more...)

arXiv.org Artificial Intelligence

2406.15485

Country:

Asia > China > Beijing > Beijing (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Pattern Recognition (0.48)

Add feedback